feat(daemon): process supervision — llama-server lifecycle + status by OpenCircuitDev · Pull Request #69 · OpenCircuitDev/opencircuitmodel

OpenCircuitDev · 2026-06-11T22:40:56Z

Summary

Track 1, item 2 from docs/AGENT_OPERATIONS.md — activate the dead-code supervisor module. The daemon can now spawn + supervise its own llama-server instead of requiring the user to hand-run it, with health-gated restart, exponential backoff, max-restart budget, and a Tauri status command for the UI to surface failure.

Ollama is intentionally not supervised here (it has its own service installer + lifecycle); the spawn-gate refuses when backend = "ollama", and the module doc explains why.

What changed

Rust (crates/ocm-daemon/)

settings — one new field: llama_server_binary: Option<String>. #[serde(default)] for forward-compat. Doc explicitly notes "no-op when backend = ollama".
supervisor — was already a partial module (Supervisor struct, spawn helpers, wait_for_http_ready); now activated. Added:
- SupervisorStatus enum (Serialize, snake_case tag): NotSpawning / Starting / Running { pid } / Restarting { attempt, last_error } / FailedAfterMaxRestarts { attempts, last_error } / Stopped.
- SupervisorPolicy with defaults: max_restarts=3, initial_backoff=500ms, max_backoff=10s, stability_window=60s, health_check_interval=5s, health_check_timeout=15s.
- compute_backoff(attempt, initial, max) — exponential, u8::MAX-safe (no shift overflow).
- supervise(supervisor, policy, status, shutdown) — health-gated restart loop. Every wait is tokio::select!-raced against the shutdown signal. Stability-window reset means a process that ran healthy then crashed isn't penalized as a flap. tracing::error when the budget is exhausted.
- Module-level doc clarifies Ollama no-spawn (own service installer; we'd fight ollama-svc's restart logic).
- spawn_vllm_server kept (already tested) but explicitly #[allow(dead_code)] — NVIDIA supervision is a separate follow-up.
bootstrap — wire-points:
- should_spawn_llama_supervisor(settings, models_dir) -> bool — the spawn-gate decision: requires backend = LlamaCpp AND llama_server_binary.is_some() AND the model GGUF exists at models_dir/<model_id>.gguf (matching ocm_models::downloader convention). Six test cases cover the matrix.
- build_llama_supervisor(settings, models_dir) -> Option<(Arc<Supervisor>, SupervisorPolicy)> — resolves binary + model path + port (from inference_base_url) + health URL (/v1/models), returns None when the gate refuses.
- LlamaSupervisorState — what main.rs app.manage()s. Holds the shared status Arc<Mutex<_>> plus the shutdown watch::Sender (in Mutex<Option<_>> so a future RunEvent::ExitRequested hook can .take() it). Dropping the state is sufficient for clean shutdown in v0.1.2; signal_shutdown stub is included for the future hook.
commands — get_supervisor_status Tauri command, returns the live SupervisorStatus.
main — supervisor wired into setup(): build → spawn supervise loop on tauri runtime → app.manage(supervisor_state). Falls cleanly back to NotSpawning when the gate refuses. Removed the #[allow(dead_code)] mod supervisor; annotation (it's live now).

Frontend — Settings interface gains llama_server_binary; settings page gains a path input with placeholder communicating None=do-not-spawn.

TDD audit trail

Commit	CI Run	Verdict
`115286e` test: RED	27379971603	❌ expected (compile-fail on missing symbols)
`35849fc` feat: GREEN	27380885416	❌ rustfmt drift on 3 multi-line decisions
`84e746b` fix: rustfmt + frontend	27381591386	❌ clippy: `unused_assignments`, `dead_code` on intentional-future-use code
`44253c4` fix(clippy)	27381788960	✅ Rust ubuntu/macOS/windows, all tests pass
(latest, frontend)	27381591372	✅ Frontend CI

New test coverage (+17 tests)

supervisor.rs (+5): SupervisorStatus::default(); SupervisorPolicy::default() lockstep with constants; compute_backoff schedule (incl. u8::MAX safety); supervise() integration Phase 1: Foundation — Tauri shell, paths, settings, CI #1: immediate-exit Command + unresponsive health URL → exactly max_restarts=2 attempts → FailedAfterMaxRestarts; supervise() integration Phase 0: Bench framework scaffold + first isolation sandbox #2: long-running sleep + early shutdown signal → Stopped + child reaped.
settings.rs (+3): default llama_server_binary == None; TOML round-trip; legacy file (no field) still parses.
bootstrap.rs (+9): full spawn-gate decision matrix (6 cases — yes / Ollama-no / Auto-no / binary-None-no / file-missing-no / model_id-None-no), parse_port helper (2 cases), build_llama_supervisor returns None when gate refuses.

ocm-daemon test count: 25 → 42 (verified in run 27381788960 log).

Design choices made

Per AGENT_OPERATIONS "NEEDS_APPROVAL when not covered by spec": the operator declined the multi-choice up front, so I took my own recommended paths and documented them:

Status surface: new Tauri command get_supervisor_status + tracing::error on FailedAfterMaxRestarts. No tray-icon hint, no UI panel in this PR (UI polish = Track 1 item 3).
ctx_len: hardcoded DEFAULT_LLAMA_CTX_LEN = 4096 const (matches implementation-plan example). No new Settings field; revisit in item 3 if needed.
Spawn-gate conservatism: if model_id is set but the GGUF doesn't exist on disk, refuse to spawn rather than burn the restart budget on a server with nothing to load. Chat fails loudly via the existing "backend not reachable" message instead.
Auto backend → no spawn: explicitly opt-in via backend = "llamacpp". Preserves pre-v0.1.2 behavior for users who never opted into supervision.

Test plan

cargo clippy --workspace --all-targets -- -D warnings — green on ubuntu/macOS/windows (run 27381788960)
cargo test --workspace — 42 ocm-daemon tests + others green on all 3 platforms (run 27381788960)
Frontend npm run check — 258 files, 0 errors (local)
Frontend Frontend CI workflow — green on Node 20 + Node 22 (run 27381591372)
Manual smoke (operator) — point llama_server_binary at a real binary, download a registry model, launch cargo tauri dev, observe llama-server spawned + supervisord. Not run here (no local Rust toolchain).
Manual failure-mode smoke (operator) — point at a binary that exits immediately; observe FailedAfterMaxRestarts after 3 attempts via get_supervisor_status (frontend hook is follow-up).

Out of scope (deferred)

vLLM supervision — spawn_vllm_server helper still exists, still tested, but explicitly not wired (heavier preconditions; separate follow-up).
RunEvent::ExitRequested hook — current clean-shutdown is Drop-of-watch::Sender (sufficient; the loop notices, calls Supervisor::stop(), sets Stopped). An explicit hook + join-handle wait would be more graceful; signal_shutdown stub is ready for it.
UI status panel — Tauri command exists, no frontend display yet. Track 1 item 3 territory.

🤖 Generated with Claude Code

…wn-gate TDD red-pass for Task 2 (Track 1 item 2). Tests reference symbols that don't yet exist; CI compile-fail IS the red. The green commit follows. supervisor.rs: - SupervisorStatus default = NotSpawning - SupervisorPolicy default uses documented constants - compute_backoff doubles then clamps at max (u8::MAX-safe) - INTEGRATION: supervise() with immediate-exit Command + unresponsive health URL surfaces FailedAfterMaxRestarts after exactly max_restarts - INTEGRATION: supervise() returns cleanly with Stopped on shutdown signal settings.rs: - Settings::default().llama_server_binary == None - TOML round-trip for the new field - Legacy settings.toml without the field still parses (None) bootstrap.rs (the spawn-gate decision matrix): - LlamaCpp + binary set + model file present → SPAWN - backend = Ollama → DO NOT SPAWN (per directive: Ollama supervises itself) - backend = Auto → DO NOT SPAWN (preserve pre-v0.1.2 behavior; opt-in only) - llama_server_binary = None → DO NOT SPAWN (directive's preserve-current clause) - model file missing on disk → DO NOT SPAWN (conservative; chat fails with the existing 'backend not reachable' message rather than burn the budget) - model_id unset → DO NOT SPAWN Design choices made (operator declined the multi-choice; recommended paths taken): - Status surface = new Tauri command + tracing::error! on Failed (not a tray-icon hint; UI panel is Track 1 item 3) - ctx_len = hardcoded 4096 const (no new setting field this PR) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…uri status Minimal impl to satisfy the RED tests. Settings opt-in, Ollama no-spawn, clean shutdown via tokio watch channel. supervisor.rs: - SupervisorStatus enum (NotSpawning/Starting/Running/Restarting/ FailedAfterMaxRestarts/Stopped) — Serialize for Tauri command return - SupervisorPolicy: max_restarts=3, initial_backoff=500ms, max_backoff=10s, stability_window=60s; per-test override OK - compute_backoff(attempt, initial, max): exponential, u8::MAX-safe - supervise(): start → wait_for_http_ready → monitor_until_dead loop with shutdown-aware tokio::select on every wait. tracing::error on FailedAfterMaxRestarts. Stability-window reset for "stable, then crashed" process. - Module-level doc explicitly states Ollama is NOT supervised here and why (own service installer, would fight ollama-svc's restart logic). settings.rs: - llama_server_binary: Option<String>, #[serde(default)] — forward-compat bootstrap.rs: - should_spawn_llama_supervisor(settings, models_dir): the decision matrix. All five "no" branches covered by RED tests. - build_llama_supervisor(settings, models_dir) -> Option<(Arc<Sup>, Policy)>: resolves binary + model path + health URL + port; returns None when spawn-gate refuses (so main.rs branches cleanly). - parse_port helper (tiny — avoids pulling in `url` crate). - LlamaSupervisorState: holds Arc<Mutex<SupervisorStatus>> + the shutdown watch::Sender (in Mutex<Option<_>> so an exit hook can .take() it later). commands.rs: - get_supervisor_status Tauri command — returns the live status enum (Serialize-flat, snake_case tag). main.rs: - Stop allowing-dead-code on supervisor (it's live now). - Setup branch: build supervisor → spawn supervise loop on tauri runtime → manage state. NotSpawning fallback when spawn-gate refuses. - Register get_supervisor_status in invoke_handler. - Clean shutdown: dropping LlamaSupervisorState drops the watch::Sender, which signals the supervise loop, which calls Supervisor::stop() (kills child) and sets Stopped. Supervisor::Drop is the belt; watch is suspenders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Format drifts caught by CI (no local rustfmt on this machine): - main.rs: assignment line-break before match - supervisor.rs: info!() multi-line; compute_backoff multi-arg break; assert_eq! split Frontend: - settings.ts: llama_server_binary field on Settings interface - settings/+page.svelte: input row, placeholder communicates None=do-not-spawn - npm run check clean (258 files, 0 errors) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three clippy gripes on the supervisor: 1. supervise() loop: `last_error` init value provably unread on all real paths but is sound as a sentinel — annotate #[allow(unused_assignments)] rather than restructure with Option (cheaper for a guaranteed-overwritten variable). 2. spawn_vllm_server: kept for the future NVIDIA-supervision path (already tested), not wired into bootstrap this PR. #[allow(dead_code)] with a doc comment explaining the deferred path. 3. LlamaSupervisorState.shutdown: observed via Drop semantics (dropping the watch::Sender wakes the supervise loop), not direct reads. Annotated and accompanied by a stub signal_shutdown method for the future ExitRequested hook. dead_code allow scoped to the field + method. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#56 verdict Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Brand and others added 4 commits June 11, 2026 15:55

OpenCircuitDev merged commit 757612a into main Jun 11, 2026
8 checks passed

OpenCircuitDev pushed a commit that referenced this pull request Jun 11, 2026

docs(release): v0.1.1 notes — add process supervision (#69) + recovered

84ff996

#56 verdict Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(daemon): process supervision — llama-server lifecycle + status#69

feat(daemon): process supervision — llama-server lifecycle + status#69
OpenCircuitDev merged 4 commits into
mainfrom
feat/process-supervision

OpenCircuitDev commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OpenCircuitDev commented Jun 11, 2026

Summary

What changed

TDD audit trail

New test coverage (+17 tests)

Design choices made

Test plan

Out of scope (deferred)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant